2024-04-16
No warm-up today! It will be a bit more input today…
Today (and next week), we will try to cover a lot!
” “models” are generally simplifications of things in the real world that nonetheless convey the essence of the thing being modeled”
“All models are wrong but some are useful” (G. Box)
(ST21, Ch 5)
Aim: Find the model that most efficiently and accurately summarizes the way in which the data were actually generated.
Basic structure of statistical models:
\[ data=model+error \]
In general, we want to predict single observations (denoted by i) from the model. The fact that we are looking at predictions of observations and not actual values of the data is denoted by the “hat”:
\[ \widehat{data_i} = model_i \] The error is then simply the deviation of the actual data from the predicted values:
\[ error_i = data_i - \widehat{data_i} \] If this doesn’t make much sense yet, let’s look at an example.
Let’s say we want to have a model of height of children (in the NHANES dataset also used in ST21).
What do you think would be a good model?
The simplest model would be… the mean of the height values! This would imply that the model would predict the same height for everyone - and all individual deviations would be part of the error term.
We can write such a simple model as a formula:
\[ y_i = \beta + \epsilon \]
\(y_i\) denotes the individual observations (hence the \(i\)) of heights, \(\beta\) is a so-called parameter, and \(\epsilon\) is the error term. In this example, the parameter \(\beta\) would be the same value (= the mean height) for everyone (hence it doesn’t need a \(i\) subscript). Parameters are values that we estimate to find the best model.
How do we find parameters that belong to the best fitting model?
We try to minimize the error!
Remember, the error is the difference between the actual and predicted values of \(y\) (height):
\[ error_i = y_i - \hat{y_i} \]
If we select a predicted value of 400cm, all individuals’ errors would hugely deviate (because no one is 4m tall). If we average these errors, it would still be a big value.
A better candidate for such a simple model is thus the arithmetic mean or average:
\[ \bar{X} = \frac{\sum_{i=1}^{n}x_i}{n} \]
Summing up all individual’s heights and dividing that number by the number of individuals gives us the mean. By definition (see book for proof, the individual errors cancel out), the average error is now 0!
We usually don’t simply average across the individual errors, but across the squared errors.
The reason is that positive and negative errors cancel each other out, which is not the case when squared.
The mean squared error would be in a different unit then the data (squared!), which is why we usually take the square root of that value to bring it back to the original unit: This leaves us with the root mean squared error (RMSE)!
Obviously, the model for predicting height from the average is not very good: It predicts the same height for all children! (The RMSE is 27 cm!)
How can we improve this model?
We can account for other information that we might have!
For example, to account for age might be a good idea: Older children are likely taller than younger ones. We plot height against age to visually inspect the relationship:
As we can see, the line (~ model) fits the data points increasingly well, e.g. if we include a constant (or intercept) and age. We would write this as this formula:
\[ \hat{y_i} = \hat{\beta_0} + \hat{\beta_1} * age_i \]
Remember from linear algebra that this defines a line:
\[ y = slope * x + intercept \]
Thus \(\beta_0\) is the parameter for the intercept and \(\beta_1\) for the slope of age!
The model fit is now much better: RMSE = 8.36 cm.
Adding gender? Does not improve model too much!
Two aims:
Describe data well (= low error/RMSE)
Generalize to new data (low error when applied to new data)
Can be conflicting!
Where does error come from? (In addition to individual differences!)
measurement error (noise): random variation in data
actual measurement is biased (broken device, bias etc.)
“thing measured” may be biased/varies a lot
wrong model specification
e.g. height goes down with age
important variable is missing from model (age!)
Simulated relationship between blood alcohol content and reaction time on a driving test, with best-fitting linear model represented by the line. A: linear relationship with low measurement error. B: linear relationship with higher measurement error. C: Nonlinear relationship with low measurement error and (incorrect) linear model
Yes! This is called overfitting.
If we fit a line too closely to the data, the model might not be able to generalize to other data well.
Why summarize data?
It’s a model & describes the data! E.g. the mean = central tendency of the data
Mean, Median, Mode?
Mean = minimizes sum of squared error, but highly influenced by outliers!
Median = “middle” value if ranked, minimizes sum of absolute error, less influenced by extreme values
Mode = most often occurring value
Example:
If 3 people earn 10,000 Euros per year and 1 person earns 1,000,000:
Mean: 257,500 Euros
Median: (Rank: 10,000; 10,000; 10,000; 1,000,000 -> middle value = )10,000 Euros
Mode: 10,000 Euros
How widespread are the data?
Variance and Standard Deviation
Variance ≊ Mean Squared Error
\[ \sigma^2 = \frac{SSE}{N} = \frac{\sum_{i=1}^n (x_i - \mu)^2}{N} \]
(Note: \(x_i\) = value of ind. observation, \(\mu\) = population mean instead of \(\hat{X}\) = sample mean)
Standard Deviation ≊ Root Mean Squared Error
\[ SD = \sigma = \sqrt{\sigma^2} \]
We usually don’t know the population mean \(\mu\), thats why we estimate the sample variance (with the “hat”):
\[ \hat\sigma^2 = \frac{\sum_{i=1}^n (x_i - \hat{X})^2}{n-1} \]
Note: we now use \(\hat{X}\) and \(n\) for the sample size. \(n-1\) is used to make the estimate more robust/less biased.
\[ Z(x) = \frac{x - \mu}{\sigma} \]
Let’s skip this (for now?) - but you should know that there are different probability distributions (normal, uniform etc.) and that we draw samples from the population when we run experiments…
(Also, some classical probability theory and resampling methods like Monte Carlo simulations are helpful to know).
We also skip Null Hypothesis Significance Testing, which is the main approach we’re using in psychology etc., as well as Confidence Intervals, Effect Sizes… You should have a rough idea of these concepts.
Remember the basic model of statistics:
\[ data = model + error \]
Our general goal is to find the model with the best fit, i.e. that minimizes the error.
One approach is the GLM. You might be surprised that a lot of the common models can be viewed as linear models:
Dependent variable (DV): The outcome variable that the model aims to explain (\(Y\)).
Independent variable (IV): The variable that we use to explain the DV (\(X\)).
Linear model: The model for the DV is composed of a linear combination of IVs (that are multiplied by different weights!)
The weights are the parameters \(\beta\) and determine the relative contribution of each IV. (This is what the model estimates! The weights thus give us the important information we’re usually interested in: How strong are IV and DV related.)
There may be several DVs, but usually that’s not the case and we will focus on those cases with one DV!
Let’s use some simulated data:
We can calculate the correlation between the two variables:
Pearson's product-moment correlation
data: df$grade and df$studyTime
t = 2, df = 6, p-value = 0.09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.13 0.93
sample estimates:
cor
0.63
The correlation is quite high (.63), but the CI is also pretty wide.
Fundamental activities of statistics:
Describe: How strong is the relationship between grade and study time?
Decide: Is there a statistically significant relationship between grade and study time?
Predict: Given a particular amount of study time, what grade do we expect?
Use the GLM (~synonymous to linear regression) to…
decribe the relation between two variables (similar to correlation)
predict DV for new values of IV (new observations)
add multiple IVs!
Simple GLM:
\[ y = \beta_0+ x * \beta_x + \epsilon \]
\(\beta_0\) = intercept, the overall offset of the line when \(x=0\) (even if that is impossible)
\(\beta_x\) = slope, how much do we expect \(y\) to change with each change in \(x\)?
\(y\) = DV
\(x\) = IV or predictor
\(\epsilon\) = error term, whatever variance is left over once the model is fit, residuals! (Think of the model as the line that is fitted and the residuals are the vertical deviations of the data points from the line!)
(If we refer to predicted \(y\)-values, after we have estimated the model fit/line, we can drop the error term: \(\hat{y} = \hat{\beta_0} + x * \hat{\beta_x}\).)
There is a close relation and we can convert \(r\) to \(\hat{\beta_x}\).
\(\hat{r} = \frac{covariance_{xy}}{s_x * s_y}\)
\(\hat{\beta_x} = \frac{covariance_{xy}}{s_x*s_x}\)
\(covariance_{xy} = \hat{r} * s_x * s_y\)
\(\hat{\beta_x} = \frac{\hat{r} * s_x * s_y}{s_x * s_x} = r * \frac{s_y}{s_x}\)
–> Regression slope = correlation multiplied by ratio of SDs (if SDs are equal, \(r\) = \(\hat{\beta}\) )
We usually want to make inferences about the regression parameter estimates. For this we need an estimate of their variability.
We first need an estimate of how much variability is not explained by the model: the residual variance (or error variance):
Compute residuals:
\[ residual = y - \hat{y} = y - (x*\hat{\beta_x} + \hat{\beta_0}) \]
Compute Sum of Squared Errors (remember?):
\[ SS_{error} = \sum_{i=1}^n{(y_i - \hat{y_i})^2} = \sum_{i=1}^n{residuals^2} \]
Compute Mean Squared Error:
\[ MS_{error} = \frac{SS_{error}}{df} = \frac{\sum_{i=1}^n{(y_i - \hat{y_i})^2} }{N - p} \]
where the \(df\) are the number of observations \(N\) - the number of estimated parameter \(p\) (in this case 2: \(\hat{\beta_0}\) and \(\hat{\beta_x}\)).
Finally, we can calculate the standard error for the full model:
\[ SE_{model} = \sqrt{MS_{error}} \]
We can also calculate the SE for specific regression parameter estimates by rescaling the \(SE_{model}\):
\[ SE_{\hat{\beta_x}} = \frac{SE_{model}}{\sqrt{\sum{(x_i - \bar{x})^2}}} \]
With the parameter estimates and their standard errors, we can compute \(t\)-statistics, which represent the likelihood of the observed estimate vs. the expected value under \(H_0\) (usually 0, no effect).
\[ \begin{array}{c} t_{N - p} = \frac{\hat{\beta} - \beta_{expected}}{SE_{\hat{\beta}}}\\ t_{N - p} = \frac{\hat{\beta} - 0}{SE_{\hat{\beta}}}\\ t_{N - p} = \frac{\hat{\beta} }{SE_{\hat{\beta}}} \end{array} \]
Usually, we would just let R do the calculations:
Call:
lm(formula = grade ~ studyTime, data = df)
Residuals:
Min 1Q Median 3Q Max
-10.656 -2.719 0.125 4.703 7.469
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 76.16 5.16 14.76 6.1e-06 ***
studyTime 4.31 2.14 2.01 0.091 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.4 on 6 degrees of freedom
Multiple R-squared: 0.403, Adjusted R-squared: 0.304
F-statistic: 4.05 on 1 and 6 DF, p-value: 0.0907
The intercept is significantly different from zero (which is usually not very relevant) and the effect of studyTime is not significant. So for every hour that we study more, the effect on the grade is rather small (4…) but possibly not present.
\(t\) ratio of \(\beta\) to its \(SE\)!
intercept: expected grade without studying at all
Often, it is useful to check how good the model we estimated fits the data.
We can do that easily by asking how much of the variability in the data is accounted for by the model?
If we only have one IV (\(x\)), then we can simply square the correlation coefficient:
\[ R^2 = r^2 \]
In study time example, \(R^2\) = 0.4 –> we accounted for 40% of the overall variance in grades!
More generally, we can calculate \(R^2\) with the Sum of Squared Variances:
\[ R^2 = \frac{SS_{model}}{SS_{total}} = 1-\frac{SS_{error}}{SS_{total}} \]
\(R^2\) is the name of the GoF stat!
A small R² tells us that even though a model might be significant, it may only explain a small amount of information in the DV
Often we want to know the effects of multiple variables (IVs) on some outcome.
Example:
Some students have taken a very similar class before, so there might not only be the effect of studyTime on grades, but also of having taken a priorClass.
We can built a model that takes both into account by simply adding the “weight” and the IV (priorClass) to the model:
\(\hat{y} = \hat{\beta_1}*studyTime + \hat{\beta_2}*priorClass + \hat{\beta_0}\)
priorClass, i.e. whether each individual has taken a previous class or not, we use dummy coding (0=no, 1=yes).studyTime across data points/regardless of whether someone has taken a class before.If we plot the data, we can see that both IVs seem to have an effect on grades:
How can we tell from the plot that both IVs might have an effect?
We previously assumed that the effect of studyTime on grade was the same for both groups - but sometimes we expect that this regression slope differs per group!
This is what we call an interaction: The effect of one variable depends on the value of another variable.
Example: What is the effect of caffeine on public speaking?
There doesn’t seem to be an effect:
# perform linear regression with caffeine as independent variable
lmResultCaffeine <- lm(speaking ~ caffeine, data = df)
summary(lmResultCaffeine)
Call:
lm(formula = speaking ~ caffeine, data = df)
Residuals:
Min 1Q Median 3Q Max
-33.10 -16.02 5.01 16.45 26.98
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.413 9.165 -0.81 0.43
caffeine 0.168 0.151 1.11 0.28
Residual standard error: 19 on 18 degrees of freedom
Multiple R-squared: 0.0642, Adjusted R-squared: 0.0122
F-statistic: 1.23 on 1 and 18 DF, p-value: 0.281
What if we find research suggesting that anxious people react differently to caffeine than non-anxious people?
Let’s include anxiety in the model:
# compute linear regression adding anxiety to model
lmResultCafAnx <- lm(speaking ~ caffeine + anxiety, data = df)
summary(lmResultCafAnx)
Call:
lm(formula = speaking ~ caffeine + anxiety, data = df)
Residuals:
Min 1Q Median 3Q Max
-32.97 -9.74 1.35 10.53 25.36
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.581 9.197 -1.37 0.19
caffeine 0.131 0.145 0.91 0.38
anxietynotAnxious 14.233 8.232 1.73 0.10
Residual standard error: 18 on 17 degrees of freedom
Multiple R-squared: 0.204, Adjusted R-squared: 0.11
F-statistic: 2.18 on 2 and 17 DF, p-value: 0.144
It looks like the effect of caffeine is indeed different for the two anxiety groups: Increasing for non-anxious people and decreasing for anxious ones.
However, the model is not significant!
This is due to the fact that we only look at additive effects (main effects) with this model. Overall, neither caffeine nor anxiety predicts grades.
In other words: The model tries to fit the same slope for both groups, which is a flat line.
explain additive effects: flat line for average caffeine effect, no difference for means of anxiety groups
To allow for different slopes for each group (i.e. for the effect of caffeine to vary between the anxiety groups), we have to model the interaction as well.
The interaction is simply the product of the two variables:
# compute linear regression including caffeine X anxiety interaction
lmResultInteraction <- lm(
speaking ~ caffeine + anxiety + caffeine:anxiety,
# speaking ~ caffeine * anxiety, # same!
data = df
)
summary(lmResultInteraction)
Call:
lm(formula = speaking ~ caffeine + anxiety + caffeine:anxiety,
data = df)
Residuals:
Min 1Q Median 3Q Max
-11.385 -7.103 -0.444 6.171 13.458
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.4308 5.4301 3.21 0.00546 **
caffeine -0.4742 0.0966 -4.91 0.00016 ***
anxietynotAnxious -43.4487 7.7914 -5.58 4.2e-05 ***
caffeine:anxietynotAnxious 1.0839 0.1293 8.38 3.0e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.1 on 16 degrees of freedom
Multiple R-squared: 0.852, Adjusted R-squared: 0.825
F-statistic: 30.8 on 3 and 16 DF, p-value: 7.01e-07
We now see that there are significant main effects for both caffeine and anxiety, as well as the significant interaction between both variables. (We have to be careful of interpreting the main effects when an interaction is also significant!)
The interpretation of the coefficients when interactions are included is not as straight forward!
If you want to report the “typical” ANOVA table with main effects and the general interaction:
Analysis of Variance Table
Response: speaking
Df Sum Sq Mean Sq F value Pr(>F)
caffeine 1 455 455 6.96 0.0179 *
anxiety 1 992 992 15.17 0.0013 **
caffeine:anxiety 1 4593 4593 70.27 3e-07 ***
Residuals 16 1046 65
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
interpretation coefficients:
intercept: intercept of anxious group!
intercept not anxious: difference intercept anxiousnotanxious
slope anxious: only for the anxious group!
slope not anxious: diff in slopes
no main effects!!!
Sometimes, we want to compare two (nested!) models to see which one fits the data better.
We can do so by using the anova()* function in R:
Analysis of Variance Table
Model 1: speaking ~ caffeine + anxiety
Model 2: speaking ~ caffeine + anxiety + caffeine:anxiety
Res.Df RSS Df Sum of Sq F Pr(>F)
1 17 5639
2 16 1046 1 4593 70.3 3e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This shows that Model 2, incl. the interaction, is to be preferred.
Note: We can only use this method with nested models, which means that the simpler (reduced) model only contains variables also included in the more complex (full) model.
Wald compares the ratio of squared errors to an F-distribution (sound familiar from ANOVA?), while likelihood ratio compares the ratio of likelihoods to a χ2 distribution
*Yes, it is kind of an ANOVA as well, in that (a ratio of) squared errors is compared to an \(F\)-distribution…
“Garbage in, garbage out” - we have to make sure our model is properly specified!
Properly specified = having included the appropriate IVs.
The model also needs to satisfy the assumptions of the statistical method (= GLM).
One important assumption of the GLM is that the residuals are normally distributed (NOT necessarily the data!).
This assumption can be violated by a not properly specified model or because the data are inappropriate for the statistical model.
We can use a Q-Q plot, which represents the quantiles of two distributions/variables (e.g. the data and a normal distribution of the same data) against each other.
If the data points diverge substantially from the line (especially in the extremes), we can conclude that the residuals are not normally distributed.
To check the assumptions, we can easily run a function for model diagnostics (incl. Q-Q plots) in R. The function, check_model(), is included in the performance package by the easystats team (who make great packages for everything related to statistical modeling!)
We’re not going into detail about all these diagnostics (and hard to see!), but it is always a good idea to run diagnostics/check assumptions for your models!
We neither mean “predicting before seeing the data/in the future” nor mean to imply causality!
It simply refers to fitting a model to the data: We estimate (or predict) values for the DV (\(\hat{y}\)) and the IVs are often referred to as predictors.
Related to: predicting future values
As you saw earlier, the “normal” ANOVA is a special case of the linear model.
The difference is that the linear model is a bit more flexible in terms of including different IVs/predictors (continuous & categorical), whereas the ANOVA can only use categorical IVs (unless you run an ANCOVA).
An ANOVA is basically an F-Test that we have used previously (e.g. anova(modelfit)).
The F statistic is calculated by differentiating Sum of Squares Within (SSW) and Sum of Squares Between (SSB) categories/factor levels:
\[ SSW = \sum_{i=1}^n{(y_{ij} - \bar{y_j})^2} \]
SSW is the (squared and summed) difference between each data point and it’s category’s mean.
\[ SSB = \sum_{i=1}^n{(\bar{y_j} - \bar{y})^2} \]
SSB is the (squared and summed) difference between each category’s mean and the overall mean.
\[ F = \frac{SSB/(M*(N-1))}{SSW/(M-1)} \]
M = number of categories/groups/factor levels; N = number of participants/observations
Next time?
Problem: Dependent measures!